Ramsey theory reveals the conditions when 
sparse coding on subsampled data is unique 

Christopher J. Hillar, Friedrich T. Sommer 



Abstract — Sparse coding or dictionary learning has been widely 
used to reveal the sparse underlying structure of many kinds 
of sensory data. A related advance in signal processing is 
compressed sensing, a theory explaining how sparse data can be 
subsampled below the Nyquist-Shannon limit and then efficiently 
recovered from these subsamples. Here we study whether the 
conditions for recovery in compressed sensing are sufficient 
for dictionary learning to discover the original sparse causes 
of subsampled data. Using combinatorial Ramsey theory, we 
completely characterize when the learned dictionary matrix and 
sparse representations of subsampled data are unique (up to the 
natural equivalences of permutation and scaling). Surprisingly, 
uniqueness is shown to hold without any assumptions on the 
learned dictionaries or inferred sparse codes. Our result has 
implications for the learning of overcomplete dictionaries from 
subsampled data and has potential applications in data analysis 
and neuroscience. For instance, it identifies sparse coding as 
a possible learning mechanism for establishing lossless com- 
munication through severe bottlenecks, which might explain 
how different brain regions communicate through axonal fiber 
projections. 

Index Terms — Dictionary learning, sparsity, compressed sens- 
ing, sparse coding, Ramsey theory 



I. Introduction 

INDEPENDENT component analysis [1 ], |2| and dictionary 
learning with a sparse coding scheme (3), |4| have become 
important tools for revealing underlying structure in many 
different types of data 0, (6). These algorithms share two 
main components: a coding step and a reconstruction. In the 
coding step of [3 1, for instance, a sparse code vector b(y) with 
a small number of nonzero coordinates is computed given a 
data point y in a data set Y. The code vector b(y) is then 
used to reconstruct y as 

y = 5b(y), (i) 

using a dictionary matrix B. The code map b(y) and the 
matrix B are fit to training data using unsupervised learning. 

If reconstruction succeeds and y = y for data in Y, then 
the representations b(y) and the dictionary B capture essential 
structure in the data. On the other hand, a nonzero difference 
y — y gives an error signal to better tune the two steps of 
coding and reconstruction to the data. Such a control loop 

C. Hillar is with the Mathematical Sciences Research Institute, Berkeley, 
CA, 94720 USA e-mail: chillar@msri.org and the Redwood Center for 
Theoretical Neuroscience, Berkeley, CA, 94720 USA. Hillar was partially 
supported by an NSA Young Investigators Grant and an NSF All-Institutes 
Postdoctoral Fellowship administered by the Mathematical Sciences Research 
Institute through its core grant DMS-0441170. 

F. Sommer is with the Redwood Center for Theoretical Neuroscience, 
Berkeley, CA, 94720 USA e-mail: fsommer@berkeley.edu. 



Adaptive Compressed Sampling 




Fig. 1. In adaptive compressed sampling (ACS), signals with unknown sparse 
ambient structure a are subsampled by an unknown compressed sensing (CS) 
matrix A to form y = Aa. Unsupervised dictionary learning is then used 
to fit a sparse generative model y = Bh(y) to the compressed data. When 
reconstruction succeeds and y = y, our main result (TheoremfTJ implies that 
the resulting dictionary B and sparse code vectors b are equalto the matrix 
A and sparse vectors a up to a fixed permutation and scaling; see Eqn. |SJ. 
Such a scheme of sparse coding might enable brain regions to communicate 
sparse feature representations through axonal fiber projections with limited 
numbers of fibers 1111 . 1121 . 



for optimizing a coding procedure is referred to as predictive 
coding or self-supervised learning in the literature [7]. The 
sparse vector b = b(y) is also sometimes called an "efficient 
representation" of y because it exposes the coefficients of 
the "independent components" or "causes" of data, thereby 
minimizing the redundancy of the description |8l, |9l. 

Thus, the columns of a learned matrix B can be interpreted 
as structural primitives inherent in the data set Y, and the 
inferred vector b(y) as specifying a sparse weighted sum of 
these primitives which reconstruct y. It has to be emphasized 
that sparse structure is a property empirically found in many 
types of sensory data, most notably natural images [3|, as well 
as natural sounds and speech J4). In the literature, predictive 
coding with a sparseness constraint is referred to as sparse 
coding or sparse dictionary learning (SDL). Importantly for 
applications, SDL can reveal overcomplete representations; 
that is, the dimension of b can exceed the dimension of 
the data y. Overcomplete representations can capture data 
mixtures that have been sparsely composed from a set of 
complete dictionaries iflOll . 

A related advance in signal processing is the paradigm of 
compressed sensing (CS) lfT3l . |[T4l (see also |[T5l for a recent 
theoretical review). The theory of CS provides a collection 



2 



of techniques to recover data vectors x with sparse ambient 
structure after they have been linearly subsampled as y = $x 
by a (known) compressive matrix $ (the number of rows n 
of $ is significantly smaller than the number of its columns). 
Typically, the sparsity assumption enforced on x is that it can 
be expressed x = 'i'a for a fixed (known) dictionary matrix 
\& and an (unknown) m-dimensional vector a with at most k 
nonzero components (i.e., entries). Such vectors a are called 
k-sparse. 

Surprisingly, under very mild CS conditions on the ma- 
trix A = <£"!', there are efficient and robust algorithms for 
recovering fc-sparse m-dimensional a (and thus x) from the 
n-dimensional subsampled vector: 

y = Aa, (2) 

as long as the dimension of y satisfies 

n > Cfclog(m/fc). (3) 

Here, C is a (small) universal constant independent of m, n, 
or fc{^| In other words, one can easily recover sparse high- 
dimensional vectors a from "projections" |2]) into spaces of 
dimension commensurate with number of active coordinates 
of those vectors. 

A common CS recovery condition on A 6 jj" xm [ s that it 
never maps two different sparse vectors to the same vectorj^] 

A&i = Aa.2 for fc-sparse ai , &2 6 R m ai = &2 . (4) 

Note that a generic square matrix A is invertible; thus, |4]) 
trivially holds for almost all matrices A whenever n — rn. 
In the interesting regime of compressed sensing, however, the 
sample dimension n is significantly smaller than the original 
data dimension m. Thus, a condition such as (|4]) supplants 
"invertibility" of the matrix A with an "incoherence" among 
every 2fc of its columns. Rather remarkably, even in the critical 
regimes close to equality of Q, condition Q holds with very 
high probability for randomly generated n x m matrices A. 

The theory of CS has implications for SDL. As the restric- 
tion ([3]l on the dimensions of A is necessary for successful 
coding in dictionary learning, it bounds the permitted degree 
of overcompleteness for SDL. Concretely, if Q is violated 
so that overcompleteness exceeds m = fccxp^, the coding 
My) of y in ([TJ cannot succeed and therefore dictionary 
learning is infeasible. Nonetheless, the theory does provide 
for an exponential degree of overcompleteness. 

Here, we ask whether the necessary conditions of CS allow 
full recovery of original sparse representations using SDL. 

Problem 1 (The ACS Problem): Let Y = {Aai, Aa N } 
be an n-dimensional data set generated linearly via Q in 
which A G jjnxm satisfies compressed sensing conditions 
Q and Q, but is unknown. Can sparse dictionary learning 
uncover the sparse causes ai , . . . , a^ and the matrix AI 

'For a more detailed discussion of these facts (including proofs) and 
their relationship to approximation theory and concentration of measure 
phenomenon, we refer the reader to 1 16| and the references therein. 

2 Letting c = min{p, 2k}, this condition is equivalent to the assertion that 
every c columns of A are linearly independent. This condition is sometimes 
expressed as cr(A) > 2k, where u(A) is the smallest number of linearly 
dependent columns of A (the spark of A). 



By imposing CS conditions, Problem [TJ combines dictionary 
learning with compressed sensing. We will refer to this com- 
bination as adaptive compressed sampling (ACS); see Fig. [TJ 
Note that Problem [TJ differs from blind compressed sensing, 
the problem of reconstructing the matrix ^ from subsampled 
data, which is ill-posed without additional constraints ifTTl . 

Problem [TJ is motivated by a basic question in theoretical 
neuroscience: How can synaptic learning enable communica- 
tion between brain regions through axonal fiber tracts that 
form severe wiring bottlenecks? One theory of communication 
through these bottlenecks is sketched in lfl2l . There, data 
vectors a correspond to firing patterns of m local neurons in 
a brain region and vectors y generated through (|2]i represent 
the activity in another set of n <C m neurons in that 
brain region, all projecting their axonal fibers to one specific 
second region. Thus, in this model, brain regions communicate 
through bottlenecks by sending subsamples of their activity. 
For decoding of these signals, the theory of [ 12 1 posits that the 
second brain region recover the original activity using SDL. 
Thus, a positive answer to Problem [TJ is crucial for this theory. 
See Section |lll] below for more on this connection. 

Since any permutation and (component-wise) scaling of a 
sparse vector is also a sparse vector, taken literally, Problem 
[Tjis ill-posed [18 |. More precisely, if P is a fixed permutation 
matri)|j and D is an invertible diagonal matrix, then 

Aa= (AD- 1 P T )(PDa) 

for each sample y = Aa. Thus, without access to A, one 
could not discriminate which of a or PDa (resp. A or 
AD _1 P T ) was the original sparse vector (resp. sampling 
matrix). Problem [TJ therefore, should be interpreted as asking 
whether up to these natural transformations (permutation and 
scaling), recovery is possible with sparse dictionary learning. 

In this note, we give a complete affirmative solution to 
Problem [TJ Suppose equation Q is used to generate a set of 
compressed data Y = {yi, . . . , y^} from a (sufficiently large) 
set of N fc-sparse vectors ai,...,ajv. These sparse vectors 
and the dictionary A are unknown to the predictive coding 
unit, but A satisfies the conditions specified in Problem 1. If 
SDL is trained with subsamples, it tries to find a coding map 
y i y My) (with fc-sparse output) and a dictionary B such 
that ([Tji reconstructs y. We prove (in Theorem [TJ below) that 
if the learning algorithm succeeds at predictive coding of the 
subsampled data: 

yi = Bbi, i=l,...,N; (5) 

then there is a fixed permutation matrix P and a fixed 
invertible diagonal matrix D such that 

A = BPD and = PDa t , i = 1, . . . , N. (6) 

Thus, the matrix A and the uncompressed vectors are recov- 
ered (up to symmetry) whenever SDL of subsampled data 
succeeds. 

3 A permutation matrix P has binary entries and exactly one 1 in each 
column and exactly one 1 in each row (thus, Pv for a column vector v 
permutes its entries). Note that PP T = P T P = I, where / denotes the pxp 
identity matrix, and M for a matrix M is its transpose. Thus, P~ 1 = P T . 



3 



To understand our result in the context of ACS, consider two 
receiver regions Ri and R2, each having access to subsampled 
data Y. Suppose that both regions are able to reconstruct Y 
as in pj using matrices B\ and B2 and code vectors b^ 
and b* , respectively. We remark that even if their coding 
paradigms are the same, regions Ri and R 2 might have 
very different initial conditions for learning; thus, a priori, 
dictionaries and codes could be very different. Nonetheless, 
it follows from |6]) that A = B\P\D\ = B2P2D2 for some 
permutation matrices Pi , P2 and invertible diagonal matrices 
-Di,D 2 . Thus, the two dictionaries are related via 

Si = B 2 P 2 D 2 D^P^ = B 2 {P2P^){P 1 D 2 D^ 1 P^). 

The two matrices P = P 2 P^ and D = P 1 D 2 D^ 1 P^ are 
easily checked to be a permutation and diagonal, respectively. 
Therefore, Bi and B 2 are a permutation scaling away from 
each other, and the same holds for b^ 1 ' and b^ . 

The most general previous result on this problem is de- 
scribed in lfl8l . Under the additional assumption that B 
satisfies CS condition Q and that certain supports (indices of 
nonzero components) of inferred vectors b, occur, the authors 
of [18] show that sufficiently many equations (|5]l imply 

In practical applications of SDL, however, one rarely has 
any guarantees on the learned dictionary matrix B or on 
the inferred sparse vectors b(y). It is also computationally 
intractable to verify a CS condition such as Q. Thus, the 
uniqueness guarantee of the sort (|6|l without assumptions on B 
or inferred vectors b(y) is ideal. As we shall see in the course 
of proving our main theorem, removing these assumptions is 
a technical challenge requiring methods from combinatorics, 
including the surprising use of a basic result in Ramsey theory 
[19|. This field of mathematics deals with proving statements 
of the following form: In every group of 6 people, there are 3 
people who all know each other or 3 people who all do not. 

The organization of this paper is as follows. Section [II] pro- 
vides the precise mathematical formulation of the uniqueness 
result (|6) alluded to above. Section III gives a short discussion 
of how our results fit into the general sparse dictionary learning 
literature and theoretical neuroscience. Finally, the proof of our 



main theorem appears in Section IV where we first verify an 
instructive easier special case. (The Ramsey theory we need 
is proved in the Appendix.) To appeal to the widest audience 
possible, we have kept our mathematical arguments as self- 
contained and elementary as possible. 

II. The ACS Reconstruction Theorem 

Recall that a k-sparse vector a is a column vector a € M. m 
with at most k < m nonzero entries. Our main theorem 
is concerned with the recovery of fc-sparse vectors a after 
having received only subsampled versions Aa. £ E™. Here, 
A G M nxm is a matrix satisfying CS condition and 
typically in applications n fclog(m/fc) so that a significant 
dimension reduction takes place (although this is not used in 
our argument). We now state precisely our main theorem. 

Theorem 1 (The ACS Theorem): Fix positive integers n 
and k < m. There are fc-sparse ai , . . . , a^ G K m with the 
following property: if A G R nxm satisfies Q and B G R nxm 



and fc-sparse bi,.. . ,bjv are such that |5]l holds, then there 
exists an invertible diagonal matrix D G M mxm and a 
permutation matrix P G W nxm such that (jfrjl is satisfied. 

Remark 1: Equation A = BPD and assumption Q already 
imply the recovery result b, = PDai. This follows since 
Aai = AD~ 1 P T h i from ^ and thus bi = PDa t from 

Remark 2: In applications, it is important to have bounds 
on the number N of equations of the form |5]) guaranteeing 
uniqueness. We address this in Section [Hi] We also note that 
our proof of Theorem [T] gives an algorithm to find P and D. 

Remark 3: It is easy to see that the assumption Q cannot 
be removed. For instance, trivially, there does not exist such 
a set of vectors ai , . . . , a^ when A = 0. 

Theorem [T] is surprising and remarkably general: it says 
that any dictionary learning scheme producing sparse recon- 
structions in a subsampled space automatically gives faithful 
transmission of sparse signals (and dictionary) regardless of 
the CS sampling matrix A, the learned dictionary matrix B, 
or the inferred sparse vectors b(y). 

To better understand some of its complexity, consider the 
statement of Theorem [T] when k < n = m, B = I is the 
identity matrix, and A G K nx ™ is invertible. In this case, 
the result implies that if Aa is fc-sparse for every fc-sparse 
a, it must be that A = PD for some permutation matrix 
P and invertible diagonal matrix D. Since every such matrix 
A = PD trivially satisfies this condition, Theorem [T] gives a 
complete characterization of all such matrices. 

Corollary 1: Fix positive integers k < n. The set of 
invertible n x n matrices A having the property that Aa is fc- 
sparse for fc-sparse a is the set of matrices PD, where P and 
D run over permutation and invertible diagonal real matrices. 

Remark 4: This corollary is also deducible from [18]. 

A surprising ingredient in our proof of Theorem[T]is a result 
from combinatorial Ramsey theory. We give here one limiting 
instance of the result we use (see Theorem [5] in the Appendix 
for the full statement and Figure [2] for an example with s — 2). 

Theorem 2: Every coloring of the integer points 1? in the 
plane with a finite set of colors has the following struc- 
tural property. For each positive integer s, there are subsets 
H\,H 2 CZ each containing s integers such that all the points 
in the grid H\ x H 2 possess the same color. 

Experimental verification that a trained ACS unit robustly 
satisfies both implications (|6]l of Theorem [T] appears in |12|. 
These findings suggest that a noisy version of Theorem [T] or 
lfl8l Theorem 3] holds. This will be a focus of future work. 

III. Discussion 

We have shown in this note that any learning scheme that 
converges a model of predictive coding Q for a sufficient 
number of samples (|2]) of compressed data solves the ACS 
Problem uniquely as long as the conditions of compressed 
sensing are met (Theorem [TJ. For standard SDL, these condi- 
tions imply that the overcompleteness of a learned dictionary 
can reach but must not exceed m « fcexp Interestingly, 
our proof of Theorem [T] does not require any assumptions 
about the predictive model. In contrast, previous studies of 
uniqueness of dictionary learning were limited to complete 



4 



dictionaries, e.g. |20|, or relied on additional assumptions for 
the predictive coding model. For example, lfl8l required B 
to fulfill CS condition Q and put restrictions on the sparse 
codes hi that could occur. In general, it seems computationally 
challenging to enforce such requirements in a model of pre- 
dictive coding that is dynamically evolving under the learning 
process. 

The number of data samples N required in our proof of 
Theorem [T] has to be very large so that Ramsey theory can 
control the supports of inferred vectors. In the Appendix, we 



derive an upper bound given by inequality (23 I. In contrast, 
the uniqueness result in lfl8l requires far fewer samples: 
N = (k + 1)(™)- It is an open problem how much smaller 
N can be made in Theorem [T] until additional assumptions 
on the predictive model become necessary. We suspect that 
a reduction to Ramsey theory might be necessary for proving 
Theorem]!] thus, without further assumptions, a large N might 
be unavoidable. In regimes of moderate overcompleteness 
or compression, theoretical [20|, [21 1 and experimental [12J 
results suggest that the typical amount of data required for 
successful SDL is much smaller than the N used in our proof. 
One possible explanation for this discrepancy is that SDL 
coding maps b(y) are usually highly structured, whereas the 
code vectors bj in the hypothesis |6]l of Theorem[T|are allowed 
to be arbitrary. 

Another open problem left unaddressed in this work is to 
find conditions under which predictive coding |5]l is guar- 
anteed to converge. Although widely used in practice, all 
known (non-convex) sparse dictionary learning algorithms lack 
a mathematical proof of convergence, and finding such an 
argument is a major open problem in the field. The most 
significant progress on this problem appears in fZ2\ . [20|, 
[21 1, where local convergence of SDL is established]^] What 
we have accomplished here with Theorem [T] is that whenever 
SDL converges, recovery of high-dimensional sparse codes is 
automatically guaranteed, independent of any assumptions on 
the learned dictionary or sparse code vectors. 

It has to be emphasized that ACS is compressed sensing 
with learning in the decoding stage. Specifically, the decoding 
stage has to infer or learn the product of the compressive 
matrix and the dictionary, both of which are made available 
to the decoder in standard compressed sensing. An earlier 
modification of compressed sensing proposed an altered en- 
coding stage fl23l (see also 1241 V Rather than using a random 
sampling matrix, a learning algorithm called uncertain com- 
ponent analysis (UCA) was designed to optimize recovery in 
the decoder. In particular, for data that are not truly sparse 
(such as natural image patches), a learned sampling matrix 
can improve recovery quality considerably. It is an interesting 
question how ACS performs when the data are subsampled 
with a trained matrix rather than a random one. 

Implications for theoretical neuroscience: An important 
consequence of our work is to provide the mathematical 
conditions under which a predictive coding model can be 

4 Roughly, local convergence means that given enough stochastically gener- 
ated samples of the form |5J, the learned matrix B and vectors hi in l|5) used 
to reconstruct the data are local minima (with high probability) of a certain 
SDL objective function. See |2l] for the most recent result of this kind. 



trained in a compressed space (locally accessible by neurons 
in a receiver region) while accurately coding for data in 
some uncompressed space (neuronal firing in a sender region). 
Predictive coding has been widely proposed to describe rep- 
resentational learning in the brain, but it is often criticized 
as unrealistic to assume that the neural code is optimized 
to reproduce the full sensory signal. For instance, recovery 
of the dictionary W for reconstructing the full data x = "fa 
from subsamples (|2]l requires a factorization A = $4* which 
is ill-posed without additional constraints ifTTl . The theory 
of ACS suggests a scheme of representational learning that 
uses predictive coding without assuming that brain regions 
reconstruct full sensory input 1T2"1 . The theory predicts that 
learning recovers the sparse causes of the sensory signal, 
which might be useful for classification or other structural 
operations on input 0, 0. Further, the learning is predicted 
to recruit only the number of neurons in the afferent region 
sufficient to represent the input (i.e., see Corollary [2] below). 

ACS theory is consistent with old ideas of efficient coding 
in neuroscience [8|, |9]. The difference is that the objective of 
efficient coding is to minimize the redundancy in the neural 
code whereas the objective of ACS is to optimize commu- 
nication through a wiring bottleneck. However, as long as 
the conditions in Problem 1 are met, the appropriate learning 
algorithms and the resulting codes are indistinguishable. For 
example, when trained with randomly subsampled images, 
ACS produces VI -like receptive fields [ 12 1. Interestingly, how- 
ever, these receptive fields are not wired into the circuitry as 
in conventional efficient coding. The receptive fields are only 
reconstructable by an outside observer who has simultaneous 
access to the neural activity in the trained ACS circuit and the 
full images. 

Implications for applications: A practical consequence of 
our main theorem is that it fully characterizes the feasible 
regime of overcompleteness in SDL. Another interesting result 
of this study is that obvious structure in the learned dictionary 
should not be the only criterion of success for SDL. For 
instance, a learned dictionary B = &'$>D~ 1 P T might not 
reveal clear structure even though learning has converged and 
the resulting sparse codes correctly represent the underlying 
sparse causes in the original data. It is an important question 
for the future how one can assess if SDL was successful when 
access to uncompressed data is impossible. For example, the 
performance of a classifier might be such a criterion 0, @. 

Finally, it would be interesting to explore whether compress- 
ing data with a random matrix before applying SDL is a form 
of regularization that reduces the number of free parameters 
in the model, thereby increasing the speed of learning. 

IV. Proof of TheoremQ] 

Our proof of Theorem [T] involves three main pieces: Theo- 
rem[3]from the Appendix, a combinatorial matrix theory result 
(Proposition [TJ, and a fact in basic linear algebra (Lemma [TJ. 
First, Ramsey theory is used to control the supports of b; 
relative to a;, and then Proposition [T]produces the permutation 
P and diagonal D (inductively) with the help of Lemma [T] 

Before proving Theorem [T] in full generality, let us consider 
the simple case when k = 1. Set a» = (i = 1, . . . , m) to 



5 



be the standard basis column vectors in M m j^] Assuming that 
Q holds for some matrix P and 1-sparse b^, it follows that 



(7) 



for some function tt : {1, . . . , m} — > {1, . . . , m} and Cj £ K. 
Notice that if a — for some i, then Aei = 0, contradicting 
assumption Q for A. We show that tt is necessarily injective 
(and thus is a permutation)^] Suppose 7r(i) = 7r(j); then, 

C ' C " 

Ae 4 = c i Be 7r ( i) = CjPe,^) = —Bcje^^ = —Aej. 

c j c j 

Again by Q this is only possible if i = j. Thus, n is injective. 
Let P and D denote the permutation and diagonal matrices: 



Cl 



P= (e w 



(i) 



*7r(m) y 



(8) 



The matrix formed by stacking left-to-right the column vectors 
on the right-hand side of (|7|l is easily seen to satisfy: 

(Bc 1 e 7r(1 ) ■ • • Pc m e 7r(m) ) = BPD. 

On the other hand, the columns Aei form the matrix A. Taken 
together, therefore, equations (|7]) imply that A = BPD. As 
discussed in Remark [T] the remainder of (|6]l follows directly 
from this matrix identity and (j4j). 

This finishes the proof of Theorem [T] with sparsity k = 1. 
Unfortunately, the proof for a general k is considerably more 
difficult. The main trouble is that for k > 1, it is nontrivial 
to produce P and D as in §8^ using only assumption Q and 
Q; this is where methods from combinatorics are needed. 

In what follows, we will use the notation [m] for the set 
{1, . . . , m}, and ('™') for the set of fc-element subsets of [to]. 
Also, recall that Spanjvi, . . . , v^} for real vectors vi, . . . , 
is the vector space consisting of their M-linear span: 

Span{v 1; . . . ,vj = |^ tiVi : t u ...,t t £ r| . 

Finally, for a set S C [to] and a matrix A with columns 
{Ai, . . . , Am}, we define 

Span{A s } = Span{^l s : s £ 5}. 

A main ingredient in the proof of Theorem[T]is the following 
fact in combinatorial matrix theory. For a concrete instance of 
this argument when k = 2, see Example [T] below. 

Proposition 1: Fix positive integers n and k < m. Let 
A £ R nxm and P e R nxm have columns {Ai,...,A m } 
and {Pi, . • • , B m }. Suppose that A satisfies (H and that 



(9) 



5 The column vector has a 1 in its ith coordinate and zeroes elsewhere. 

6 Injectivity for a function tt : S 1 — > T from a set S to a set T means that 
7r(si) = tt(s2) if and only if s\ = S2- The function tt is bijective if in 
addition to being injective it is also surjective; that is, for each t £ T there 
exists an s 6 S such that 7r(s) = t. A bijective function 7r has an inverse 
7r _1 : T — > S 1 which satisfies ^(-n-" 1 (t)) = t and 7r -1 (-7r(s)) = s for all 
s 6 S and t £ T. When 5 = T is a finite set, an injective function is also 
bijective; in this case, the function tt is called a permutation of the set S. 



is a map with the following property: for all S £ ('T')> 

Span{^ s } -Span{P Q(s) }. (10) 

Then, there exists a permutation matrix P 6 E roxm and an 
invertible diagonal matrix D e R mxm such that A = BPD. 

Proof: We shall induct on k, the base case k = 1 having 
already been worked out at the beginning of this section. We 
first prove that a is injective (and thus bijective). Suppose that 



S U S 2 € 



are such that a (Si) = 01(52); then by 



10 1 



Span{A Sl } = Span{P Q(Sl) } = Span{P a(S2) } = Span{A 5 J. 

In particular, using ( fT3| l from Lemma [T] below with £ = k 
and AI = A, it follows that Si = S2 and thus a is bijective. 
Moreover, from this bijectivity of a and the fact that every k 
columns of A are linearly independent, it follows that every 
k columns of P are also linearly independent. 

We complete the proof, inductively, by producing a map: 



k-1 



(11) 



which satisfies 



Span{As} = Span{P T(s) } for S £ 



k-1 



Let (3 = 



denote the inverse of a. Fix S — 



{h,...,i k -i} € Q^ J J, and set Si = SU{r} and S 2 = 
S U {s} for some fixed r,s £ S with r 7^ s (so that 
ft(Si) ^ (3(S2) by injectivity of /3)j^] Intersecting equations 
( fl0| with 5 = /3(Si) and S = /3(5 2 ) and then applying 
identity (|T4j» with M = A, it follows that 

Span{ P s ,P r } n Span{P s ,P s } = Span{^ (Sl ) n /3(s 2 )}- 

(12) 

Since the left-hand side of (12 1 is at least k— 1 dimensional^ 
the number of elements in the set /3(Si ) n /3(S2) is either k — 1 
or fe. But /3(Si) ^ ^(S 2 ) so that /3(Si) n /9(5 2 ) consists of 
k — 1 elements. Moreover, 

Span{P s } C Span{A /3(Sl)n/3(S2) } 

implies that Span{P s } = Span{A i a (Sl)n/3 ( S2) }0 

The association S H> P(Si) PI /3(S 2 ) discussed above 
defines a function 7 : (,[ m ! L ) — > (H) with the property 
that SpanjPg} = Span{A 7 ( S )}. Finally, we show that 7 is 
injective, which implies that r = 7 _1 is the map desired in 
(j]} for the induction. If j(S) = j(S'), then Span{P s } = 
Span{P S '}. By (fBjl in Lemma [l] with I = k - 1 and M = B, 
we have S = S'. Thus, 7 is injective, finishing the proof. ■ 
The following elementary fact in linear algebra was used to 
prove Proposition [T] above. 

Lemma L Let M £ M. nxm . If every set of £ + 1 columns 
of M are linearly independent, then for S, S' £ (ty), 



Span{M s } = Span{M 5 /} 



S = S'. 



(13) 



7 Here we use the assumption that k < m so that such a pair r 7^ s exists. 

8 Recall that we showed that every k columns of B is linearly independent. 

'Use the following basic fact of linear algebra. If U C V are two subspaces 
of a vector space W such that dim((7) = dim(V) (i.e. they have the same 
vector-space dimension), then U = V. 



6 



then 



If M satisfies condition (0 and Si, S3 € ( b 

Span{M Sl ns 2 } = Span{M Sl } n Span{M Sa }. (14) 

Proof: To prove statement ( [13) , suppose by way of 
contradiction that S ^ S' E i^f) are such that Span{A/ s } 
Span < \ /-. ■ > Then, without loss of generality, there is an i G S 
with i ^ S'. But Mi € SpanjAfs/}, which would imply that 
the £ columns of M determined by S 1 U {i} are not linearly 
independent, a contradiction to the assumption on M. 

We now prove (14") , The inclusion C in (14 1 is trivial, so 
suppose y e SpanjMgj }nSpan{ Ms 2 }. Express y as a linear 
combination of k columns of M indexed by Si and, separately, 
as a combination of k columns of M indexed by S%. By 
assumption Q, these linear combinations must be identical. In 
particular, y was expressed as a linear combination of columns 
of M indexed by Si n S%, and thus is in Span{Ms 1 n s?}- ■ 
Example 1: We give a simple example of how the proof of 
Proposition [T] works in the case n = to = 3, k = 2. Suppose 
is the (necessarily bijective) map: 



that a: ('| 



. 2 , 



a({l,2}) - {2,3}, a({l,3}) = {1,2}, a({2,3}) = {1,3}. 
Following the proof of Proposition [T[ one can check that 

7({1}) = {3}, 7({2})={1}, 7({3}) = {2}, 



and thus we obtain the function r — 7 _1 as desired in (11 
The resulting permutation P represents the cycle (123). 

We now prove Theorem [T] by combining Theorem [5] from 
the Appendix with Proposition [T] 

Proof of Theorem |7J It is sufficient to construct a map 



{ 



}^{r l7 ...,r fe } (15) 



7) - {k 

satisfying hypothesis (10 1 in Proposition [T] 

Fix a finite subset T C R. Also, fix {ii, ■ ■ ■ ,ik} Q [ m ], 
and let a = tie^ + ■ ■ ■ + i^e^ for ti G T. Suppose that b 
is Ai-sparse with Aa = Bh; then, b e Span{e J1 , . . . , ej k } for 
some {ji, . . . , jk} € ( ^ ). Viewing each A: -element subset of 
[to] as a coZor, this map 



is a coloring of the finite set T k with colors in C — ( ™ )■ 

We now apply Ramsey theory. For sufficiently large sets T, 
Theorem [5] below guarantees that regardless of the the map 
f there are subsets Hi, ... , H k C T each having s = 2 
elements and {ri, . . . , r fe } <G such that . . . ,t k ) = 

{ri, . . . , Tk] is constant for all (ti, . . . ,t&) € Hi X • • • X iJ^. 

We claim that defining a({ii, . . . , ik}) = {ri, . . . , r^} 
according to the above recipe gives a map satisfying ( [T0| . 
To verify the claim, we shall prove that 

Spai^Ae^, . . . , Ae ik } C Span{_Be ri , . . . , i?e r)c }, (16) 

which is inclusion C of ( |T0j >. To see how the other inclusion 
follows from this, observe that the left-hand vector space in 



(16 1 is fc-dimensional, while the right-hand space is at most 



fc-dimensional; thus, equality holds. 



To verify (I61, we need to show that each Ae^ is in the 



Hi x • • • x Hk C K fe which differ only in the £th coordinate. 
By construction, the vectors A&i , A&2 corresponding to these 
two points have difference A (a.i — a 2 ) that is a nonzero scalar 
multiple of Ae^ and is also in the right-hand span of ( 16 1. 



right-hand span. To see this, consider a pair of elements in 



We close this section by stating a generalization of Theorem 
[T]that allows for arbitrary receiver dimensions p > m. We omit 
the similar proof. For this generalization, we define a column 
permutation matrix to be a binary matrix having exactly one 
1 in each column and at most one 1 in each row. 

Theorem 3 (Overcomplete ACS Theorem): Fix positive in- 
tegers n and k < to < p. There are /c-sparse ai, . . . , ajv € K m 
with the following property: if A € jj nxm satisfies Q and 
B € R nx P and /c-sparse b i7 . . . , b N € W are such that ^ 
hold, then there is an invertible diagonal matrix D 6 flj mxm 
and a column permutation P € ]RP xm such that A = BPD. 

Moreover, if the coding map b(y) satisfies b(fa) G 
Span{b(a)} for t e 1 and 1-sparse a, then bj = PD&i and 
PP T h t = b, for alH = 1, . . . , N. 

Corollary 2 (ACS Efficiency): Under the assumptions of 
Theorem|3] there is a fixed set of p— m coordinates of inferred 
sparse vectors b^ which are always zero. 

Proof: Let P be the column permutation matrix from 
the conclusion of Theorem [3] The number of rows of P that 
consist of all zeroes is p — m since there are only to entries 
of P equal to 1. But then left multiplication by the matrix P 
on vectors P T hi produces vectors with a fixed set of at least 
p — to coordinates that are zero. ■ 

As an application of Theorem [3] consider its interpretation 
in the context of sender and receiver populations of neurons. 
Suppose that sparse dictionary learning has been successfully 
applied in a receiver region to decode a sender's signals. Then, 
Theorem [3] says that the receiver will recover the original codes 
up to natural equivalences, while Corollary [2] predicts that any 
extra neurons will become decoupled from the input stream 
and thus be free to be utilized for other purposes. 

Appendix 
Ramsey Theory 

We now explain how to prove Theorem B] below; it was a 
crucial ingredient in the proof of Theorem \lj Its statement is 
very similar to a basic result in Ramsey theory [19, Theorem 
A]. For a recent survey of the field of Ramsey theory, see the 
article ESI , and for a compilation of several applications to 
computer science and mathematics, see the paper ll26l . 

Given positive integers c and sj., . . . , s c , the Ramsey number 
R(si, . . . , s c ) is defined as the least integer R (if it exists) such 
that if the edges of the complete graph Kr on R vertices are 
colored with c colors, there is an i E [c] and a subset of 
Si vertices of Kji all of whose pair-wise edges are the same 
color i. Ramsey's Theorem is then the statement that a finite 
R(si, . . . , s c ) always exists. For instance, as pointed out near 
the end of the introduction, we have i?(3, 3) = 6. To prove 
Theorem [T] however, we need the following modification of 
Ramsey's result. 

Theorem 4 (Infinite version): Fix a finite set C of c col- 
ors and positive integers k, s. For all sufficiently large sets 
Ti,.. .,T k , every coloring 

/ : Ti x • ■ ■ x T k -> C 



7 



ooo o 

oo«o 

••C O 3CO 

Fig. 2. Iterated pigeon-hole principle. How to find a 2 x 2 submatrix of 
the same color in a 2-coloring of a 3 X 7 grid (this is the s = 2, c = 2, 
k = 2 case of Theorem |3}. In the top row of the figure above, there are 4 
entries with the same color blue (they have black borders). Of the 4 entries 
directly below the blue from the top row, there are 3 of them that are colored 
white. Finally, below these lie 2 that are blue. The entries comprising the 
resulting 2x2 submatrix with the same color are shown with a black dot in 
their centers. Our example also shows that such a 2 X 2 submatrix need not 
exist in a 3 X 6 grid. Note that a direct application of Ramsey's theorem, as 
suggested by Remark[5] requires a grid of size 18 X 18 since i?(4, 4) = 18. 
To give the reader some sense of the complexity of these questions, we remark 
that the number R(5, 5) is not known although it is between 43 and 49 [28]. 



of the points in Ti x • • • x T& obeys the following structural 
property: there are subsets Hi C Tj of size s such that all 
points in H\ x • • • x 7T& have the same color. 

Remark 5: To see how Theorem [4] is related to Ramsey's 
theorem, consider when k = 2 and c = 2 (see Figure pp. 
In this case, one easily verifies that if |Ti|, |Ta| > R(2s, 2s), 
there are subsets Hi , 77 2 of size s as in the theorem statement. 

Figure [2] shows how "sufficiently large" in Theorem [4] may 
be taken to be |Ti| > 3 and |T 2 | > 7 in the case c = k = 
s = 2. Thus, defining T(c, k, s) to be the smallest such product 
1 2i| • • • |Tfc| among sizes of sets Tj guaranteeing the conclusion 
of Theorem |5j we have T(2, 2,2) < 21. More generally, the 
number T(c, k, 2) will be important when we address below 
the question posed in Remark [2] 

As Melody Chan points out Il27ll . Theorem|4]can be deduced 
in the same manner as the standard inductive proof of Ram- 
sey's Theorem. However, to best answer Remark [2] we desire 
more effective bounds than those offered by this argument. 

Given positive integers k, s, and c, define numbers 
do, g?i, • • • , dk recursively as follows: 

1 



di = s 



2d,_! • _ 



1 = 1, 



,k; d a = 



(17) 



Notice that these (towers of) numbers grow very rapidly: 



di = sc, d 2 = s ■ 



and d 3 = s ■ . 



A version of the following fact is likely known, but we include 
a proof for completeness. 

Theorem 5 (Effective version): Fix a finite set C of c colors 
and positive integers k,s. If Ti,...,Tk are finite sets with 
sizes |Tj| > di, then for every coloring 

/:'/: ■■■ II ■(■ 

of the points in T% x • • • x Tfc, there are Hi C Tj of size s 
such that all points in H\ X • • • X 77^ have the same color. 

Proof: First note that it is enough to prove the theorem for 
set sizes |Tj| = di (by removing points as necessary). When 
k = 1, the proof boils down to the pigeon-hole principle: since 



the range of / is finite of size c and since |Ti| = d\ = sc, 
there must be at least s — ^ points in |2i| which map to the 
same element of C. For expositional clarity, we only sketch 
the details of the proof when k — 3, as the ideas are the same 
in general. Since c = 1 is trivial, we shall assume c > 1. 
Enumerate the elements of the three given sets as: 



Ti — {tn, ■ ■ ■ , tldx}, T 2 — {t 2 \, ■ ■ ■ , t 2 d 2 }, 



T 3 = {t 3 



,hd 3 }- 



(18) 



Consider the tree of height 2 with root tn and children 
^2i; • • • i *2d 2 > eacn °f whom have children t 31 , . . . , £3^. Since 
the number of children of i 2 i is ^3, there is a subset T3 C T 3 
with — elements such that 

c 

f(tn,t 2 i,T^) := {f(tn,t 2 i,t') it'eT^CC 

is a single color in C. Consider now those children of 
t 22 which are also members of T3. As before, there is a 
subset X3 7 C T3 of size ^ satisfying |/(i u , t 22 , Tg)\ = 1. 
Continuing in this manner a total of d 2 times, one produces 



a subset T, 



(i) 
i 



C T 3 of size with the property that 
\f{tn,t 2 i,T^' > )\ = 1 for all i = 1,... ,^2- Examine now the 



images f(tu, t 2 %, ) as i varies. As is easily seen, there 
is a subset T 2 (1) C T 2 of size ^ such that /(tu, T 2 (1) , T 3 (1) ) 
consists of only 1 element. 

The procedure in the previous paragraph constitutes the 
j = 1 st round in a process of di rounds that will produce the 
desired subsets Hi,H 2 ,H 3 . Set = T l for each i. To move 
generally from round j to round j + 1, the idea is to consider 
the same tree as before but with root tij having children T 2 

(i) 

(each of which have children -T 3 ), and to produce new subsets 



T. 



C T 3 ) and T 2 



O'+i) 



C r 2 (3) with the same procedure as 
above. At the end of di such rounds, we will have produced 
nested subsets 



T, 



C • • • C T, (1) C T. 



,(0) 



and 71 



(di) 



c • • • c r 9 (1) c t. 



.(0) 



such that 



|/(ty,T 2 (i) ,T 3 (i) )| = l, for each j = l, 



(19) 



.,di. (20) 



It is easy to check that after j such rounds of culling, 



0)i 



IT, 



(3-1)1 



and |T. 



(i)i 



IT 



(3-1)1 



C |T 2 W - I5 | 



(21) 



A straightforward induction shows that eqs. (21 1 have solution 



IT 



(3) 1 _ 



C3 



and |ri j) | = 



1 Z^f = o 



(22) 



Set 77 9 



T, 



(di) 



and i?3 = T 3 ^. Since c?! = sc, we know 



from ( fl9] l and ( |20] i that there is a subset 77i C Ti of s elements 
such that f(Hi,H 2 ,H 3 ) consists of only one color from C. 

We are done, therefore, as long as 7T 2 , 77 3 have size at least 
s. For 77 2 this is clear, while for 773 we compute using (|22|: 



I77,| = |Ti dl 



> 



^3 



> 



„2d 2 



8 



Remark 6: Clearly, the tower of numbers di defined by ( 17 1 
can be decreased in Theorem [5] (see Figure 2), but we do not 
know by how much. 

Corollary 3: We have the following bound: 

k 

T( c ,k,s) < n^. 

In the application of Theorem [5] for Theorem [T] we have 
c = (™) and s — 2. Thus, the following is an upper bound on 
the the number N of used in our proof of Theorem [T] 



N < 



T 



k,2 



(23) 



As a final remark, we note that the computational decision 
problem associated with finding Hi X • • • X all of the same 
color as in Theorem [3] is NP-complete when c > 1 (reduce to 
BICLIQUE and then apply the results of ||29l). 

Acknowledgment 

The authors would like to thank the following people for 
helpful discussions: Charles Cadieu, Jack Culpepper, Mike De- 
Weese, Guy Isely, Amir Khosrowshahi, and Chris Rozell. We 
also thank Melody Chan for discussions on Ramsey theory and 
Matthias Mnich for explaining the NP-completeness lurking 
inside of Theorem [5] 



[16] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin, "A simple proof 
of the restricted isometry property for random matrices," Constructive 
Approximation, vol. 28, no. 3, pp. 253-263, 2008. 

[17] S. Gleichman and Y. Eldar, "Blind compressed sensing," Information 
Theory, IEEE Transactions on, vol. 57, no. 10, pp. 6958-6975, 2011. 

[18] M. Aharon, M. Elad, and A. Bruckstein, "On the uniqueness of overcom- 
plete dictionaries, and a practical way to retrieve them," Linear algebra 
and its applications, vol. 416, no. 1, pp. 48-67, 2006. 

[ 1 9] F. R Ramsey, "On a problem of formal logic," Proceedings of the London 
Mathemati cal Society, vol. s2-30, no. 1, pp. 264-286, 1930. [ Online] . 
Available: http://plms.oxfordjournals.Org/content/s2-30/l/264.short 

[20] R. Gribonval and K. Schnass, "Dictionary Identification-Sparse Matrix- 
Factorization via 1 1 -Minimization," Information Theory, IEEE Transac- 
tions on, vol. 56, no. 7, pp. 3523-3539, 2010. 

[21] Q. Geng, H. Wang, and J. Wright, "On the local correctness of 11- 
minimization for dictionary learning," CoRR, vol. abs/1 101.5672, 2011. 

[22] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, "Online learning for matrix 
factorization and sparse coding," The Journal of Machine Learning 
Research, vol. 11, pp. 19-60, 2010. 

[23] Y. Weiss, H. Chang, and W. Freeman, "Learning compressed sensing," 
in Snowbird Learning Workshop, Allerton, CA. Citeseer, 2007. 

[24] V. Abolghasemi, D. Jarchi, and S. Sanei, "A robust approach for 
optimization of the measurement matrix in compressed sensing," in Cog- 
nitive Information Processing ( CIP), 2010 2nd International Workshop 
on. IEEE, pp. 388-392. 

[25] R. Graham, "Old and new problems and results in ramsey theory," 
Horizons of Combinatorics, pp. 105-118, 2008. 

[26] V. Rosta, "Ramsey theory applications," the electronic journal of com- 
binatorics, pp. 1^43. 

[27] M. Chan, Private communication, 2011. 

[28] S. Radziszowski et al, "Small ramsey numbers," Electronic Journal of 

Combinatorics, vol. 1, p. 28, 1994. 
[29] M. Dawande, R Keskinocak, J. Swaminathan, and S. Tayur, "On bipartite 

and multipartite clique problems," Journal of Algorithms, vol. 41, no. 2, 

pp. 388^103, 2001. 



References 

[1] R Comon, "Independent component analysis, a new concept?" Signal 

processing, vol. 36, no. 3, pp. 287-314, 1994. 
[2] A. Bell and T. Sejnowski, "Learning the higher-order structure of a 

natural sound," Network: Computation in Neural Systems, vol. 7, no. 2, 

pp. 261-266, 1996. 
[3] B. Olshausen and D. Field, "Emergence of simple-cell receptive field 

properties by learning a sparse code for natural images," Nature, vol. 

381, no. 6583, pp. 607-609, 1996. 
[4] E. Smith and M. Lewicki, "Efficient auditory coding," Nature, vol. 439, 

no. 7079, pp. 978-982, 2006. 
[5] J. Yang, K. Yu, Y. Gong, and T. Huang, "Linear spatial pyramid 

matching using sparse coding for image classification," IEEE Conference 

on Computer Vision and Pattern Recognition (CVPR), 2009. 
[6] Y. Boureau, F. Bach, Y. LeCun, and J. Ponce, "Learning mid-level 

features for recognition," Proc. International Conference on Computer 

Vision and Pattern Recognition (CVPR), 2010. 
[7] G. Hinton, "Connectionist learning procedures," Artificial intelligence, 

vol. 40, no. 1-3, pp. 185-234, 1989. 
[8] F. Attneave, "Informational aspects of visual perception," Psychol. Rev., 

vol. 61, pp. 183-93, 1954. 
[9] H. B. Barlow, "Possible principles underlying the transformation of 

sensory messages," 1961. 
[10] S. Chen, D. Donoho, and M. Saunders, "Atomic decomposition by basis 

pursuit," SIAM review, vol. 43, p. 129, 2001. 
[II] W. Coulter, C. Hillar, G. Isley, and F. Sommer, "Adaptive compressed 

sensingA new class of self-organizing coding models for neuroscience," 

in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE 

International Conference on. IEEE, pp. 5494-5497. 
[12] G. Isely, C. Hillar, and F. Sommer, "Deciphering subsampled data: 

adaptive compressive sampling as a principle of brain communication," 

Advances in neural information processing systems, 2010. 
[13] D. Donoho, "Compressed sensing," Information Theory, IEEE Transac- 
tions on, vol. 52, no. 4, pp. 1289-1306, 2006. 
[14] E. Candes and T. Tao, "Decoding by linear programming," Information 

Theory, IEEE Transactions on, vol. 51, no. 12, pp. 4203^1215, 2005. 
[15] J. D. Blanchard, C. Cards, and J. Tanner, "Compressed sensing: 

How sharp is the restricted isometry property?" SIAM Review, 

vol. 53, no. 1, pp. 105-12 5, 2011. [Online]. Available: [rittpll 

//link.aip.org/link/?SIR/53/105/l 



